5. Decision Trees- Classifier

arrow_back Back to Experiments

5. Decision Trees- Classifier

Aim

    To write a python program for decision tree classifier using scikit learn module to classify Iris flower data set

Understand the Decision Trees- Classifier Before You Begin

Overview: Decision Tree Classifier is a supervised machine learning algorithm used for both classification and regression tasks. It works by recursively splitting the dataset into subsets based on feature values, creating a tree-like model where each internal node represents a decision based on a feature, each branch represents the outcome of the test, and each leaf node represents a class label or regression value.

The algorithm selects the best splits using criteria like Gini impurity, entropy, or information gain to maximize the purity of child nodes. Decision Trees are widely used for medical diagnosis, credit risk assessment, customer churn prediction, and feature selection due to their interpretability.

Further Understanding: Decision Trees

Algorithm

  1. Load the Dataset: Load the Iris dataset using load_iris.
  2. Binarize the Target: Extract "sepal length" and "sepal width" as the feature set X, and the target labels as y.
  3. Select Features and Labels: Prepare the feature set X and target labels y for the KNN model.
  4. Split the Dataset: Split the dataset into training and testing sets using train_test_split, ensuring class distribution is maintained with stratify=y.
  5. Create a Pipeline: Create a pipeline with StandardScaler for feature scaling and KNeighborsClassifier for the KNN model.
  6. Initialize Plot: Create a figure with two subplots to visualize the KNN decision boundaries with different weight strategies.
  7. Iterate Over Weight Strategies: For each subplot, set the KNN weight strategy (uniform or distance) and fit the model to the training data.
  8. Visualize Decision Boundaries: Plot the decision boundary for each weight strategy to show classification regions.

About Iris Dataset

The data sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x4 numpy.ndarray. The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal Length and Petal Width.

Dataset Information

Number of Instances 150 (50 in each of three classes)
Number of Attributes 4 numeric, predictive attributes and the class
Attribute Information
  • sepal length in cm
  • sepal width in cm
  • petal length in cm
  • petal width in cm
  • Classes 3 (Iris-Setosa, Iris-Versicolour, Iris-Virginica)

    Source: Dataset Link

    Visualization

    Interactive Visualization of Decision Trees- Classifier.

    Pre-Lab Questions

    1. Why decision tree is prepared to use with an ensemble approach?
    2. What is Information Gain? How it is related to decision tree?
    3. What is Gini index? How it is related to decision tree?

    Post-Lab Questions

    1. Apply standard scaling to the data set and give your observation/comments on the performance.
    2. Run the code for the wine dataset from scikit learn module and presents the results.

    Result

    The decision tree classifier was successfully implemented on the Iris dataset. The model achieved high accuracy, and the resulting tree structure and confusion matrix clearly demonstrated effective classification across all three iris flower species.